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Abstract 

Complex-valued signals are used in the modeling of many systems in engineering and science, hence being of 
fundamental interest. Often, random complex-valued signals are considered to be proper. A proper complex random 
variable or process is uncorrelated with its complex conjugate. This assumption is a good model of the underlying 
physics in many problems, and simplifies the computations. While linear processing and neural networks have been 
widely studied for these signals, the development of complex-valued nonlinear kernel approaches remains an open 
problem. In this paper we propose Gaussian processes for regression as a framework to develop 1) a solution for 
proper complex-valued kernel regression and 2) the design of the reproducing kernel for complex-valued inputs, 
using the convolutional approach for cross-covariances. In this design we pay attention to preserve, in the complex 
domain, the measure of similarity between near inputs. The hyperparameters of the kernel are learned maximizing the 
marginal likelihood using Wirtinger derivatives. Besides, the approach is connected to the multiple output learning 
scenario. In the experiments included, we first solve a proper complex Gaussian process where the cross-covariance 
does not cancel, a challenging scenario when dealing with proper complex signals. Then we successfully use these 
novel results to solve some problems previously proposed in the literature as benchmarks, reporting a remarkable 
improvement in the estimation error. 


Index Terms 

Gaussian processes, regression, proper complex processes, kernel methods. 


I. Introduction 

C OMPLEX-VALUED signals model a vast range of nowadays systems in science and engineering such as 
telecommunications, optics, electromagnetics, and acoustics among others. The main advantage of using 
complex-valued signals is the availability of processing the real and imaginary parts as a single signal. For these 
reasons, complex-valued signal processing is of fundamental interest. 

Signal processing for complex-valued signals has been widely studied in the linear case, see [1] and references 
therein. A proper complex random signal is uncorrelated with its complex conjugate [2], Properness is useful 
because it simplifies computations and in many cases models the underlying physics of the problem at hand. Most 
of the lineal - solutions in the real case admit a proper complex signals version with minor changes [3], For the 
processing of improper signals further development of the linear solutions are needed [3], [1], 

The nonlinear processing of complex-valued signals has been addressed from the point of view of neural networks, 
[4] and, recently, using reproducing kernel Hilbert spaces (RKHS) [5], Some complex kernels have been lately 
proposed for classification [6], regression [7], [8], [9] and mainly for kernel principal component analysis [10], 
Regarding regression, in [11] the authors propose a complex-valued kernel based in the results in [6] and face the 
derivative of cost functions by using Wirtinger’s derivatives. Same kernel is adopted in [7], As discussed later in 
this paper the resulting approach involves properness. Besides, the kernel used is neither stationary nor isotropic, 
and it may suffer from numerical problems. In [9] the authors review the kernel design to improve the previous 
solution with a kernel they denote as independent. The resulting kernel yields also proper complex-valued outputs. 
The kernel is stationary, but again it is not isotropic in the complex-valued input space, as the real and imaginary 
parts of the input are split and fed to different real valued kernels. Hence, these previous works do not report 
results for isotropic and stationary kernels that may better model the underlying physics of some systems. Also, the 
structure of the kernel remains more rigid than needed. These drawbacks make these solutions not powerful enough 
to learn a wide range of systems. The previous approaches have been developed in the framework of kernel least 
squares and for adaptive filtering, although the extension to batch processing is immediate. We bring here Gaussian 
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processes for regression (GPR) as a useful tool to better study this problem of kernel complex-valued nonlinear 
regression [12]. 

GPRs arc well known techniques in machine learning [13], [14]. They have been successfully applied to 
regression, classification and dimensionality reduction. GPs can be interpreted as a family of kernel methods with 
the additional advantage of providing a full conditional statistical description for the predicted variable, where 
hyperparameters can be learned by maximizing the marginal likelihood, avoiding cross-validation. Properties and 
features of random complex-valued Gaussian processes have been widely studied [15], facilitating the extension of 
GPR for the complex-valued case proposed in this paper. The benefits arc twofold. On the one hand we have a novel 
tool for nonlinear regression providing probabilistic outputs. Second, since this Bayesian approach is connected to 
the regularization point of view [16], we prove the structure of the reproducing kernel and study the conditions to 
be valid for proper complex outputs. The solution of GPR in the general complex-valued case is quite involved, 
since the pseudo-covariance is to be taken into account. In the proper scenario the pseudo-covariance cancels, and 
the formulation of Gaussian processes for real-valued signals can be adapted [2]. We describe this new solution, 
providing the mean and the covariance of the complex-valued outputs. We particularly focus on the kernel, to 
improve previous solutions. We derive its structure, being complex valued, where the real paid of the kernel is 
given by the covariance of the real paid of the outputs plus the covariance of the imaginary paid of the outputs, 
while the imaginary paid of the kernel describes the cross-covariance between real and imaginary parts of the 
outputs. We prove that the real and imaginary parts of the kernel can be designed with different features. But 
we conclude that the imaginary paid, in addition to be skew-symmetric, must be constructed to ensure the whole 
covariance to be semi-definite positive, i.e. a reproducing kernel or covariance matrix [17]. We also pay attention 
to the modeling of physical systems, and propose a kernel for the real paid based on the exponential kernel. At this 
point, when measuring similarity between inputs we resort to the absolute value of the complex difference between 
inputs. As a result, the kernel is isotropic and stationary. Other real kernels may be adapted using this metric, to 
better model the physics of the problem at hand. The construction of the imaginary paid is challenging. Conditions 
on this matrix to ensure properness and to be a valid kernel quite limit the degree of freedom in the kernel design, 
and the systems we may model. We resort to the convolution approach [18], [19] to ensure the whole matrix to be 
a reproducing kernel. We propose a model for the cross-covariance that explains a positive and negative correlation 
of the real part of the output with the imaginary one for a positive and negative delay, canceling at origin. Every 
system modeled with complex-valued signals can be rewritten with real-valued ones by splitting the output signal 
into a composite two-dimensional vector, with real and imaginary parts. For the sake of comparison, the multiple 
output learning (MOL) [18], [16] is proposed to solve this composite kernel complex-valued regression problem 
and adapt it to the proper complex case. In this scenario we lose the intuition about complex values, and we cannot 
deal with inputs and outputs easily pairing real and imaginary parts to measure similarity between them. Also, we 
may lose some of the computational benefits of working with complex numbers, such as in the computation of the 
matrix inversion, [20], [21]. However, we prove that both approaches yield equivalent results, gaining some further 
intuition on the kernel properties and structure. 

One of the advantages of GPR is the availability of estimating hyperparameters by maximizing the marginal 
likelihood. Here we face the derivation of a function of complex-valued parameters and variables. We cope with 
this problem by using Wirtinger’s derivatives. To illustrate the good performance of this approach, we include a 
first experiment where we learn a Gaussian process with unknown hyperparameters, where we assume the cross¬ 
covariance does not cancel. Then we compare to the algorithms in [8], [9] using the same scenario than in the 
experiments described in [8] as benchmark. Due to the improvements derived from the proposals in this paper, we 
achieve remarkable good results. At this point we conjecture that whenever we do not know the cause or model 
of the cross-covariance or this cross-covariance is negligible, assuming a real-valued reproducing kernel is a good 
option. 

The rest of the paper is organized as follows. Next section includes the derivation of GPR for proper complex¬ 
valued signals. In Section III we exploit this formulation to study the structure and construction of the kernel, 
compare it to the kernel in the MOL approach, and provide some examples for the kernel. We develop in Section IV 
the optimization procedure to set the kernel hyperparameters applying Wirtinger’s calculus and patterned complex¬ 
valued matrix derivatives [22]. The good performance of this approach is illustrated in Section V, learning a unknown 
process and comparing to previous approaches in the channel equalization problem. We end with conclusions and 
an appendix. 
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The notations used in the paper are as follows. If A is a matrix, (A )^ q is its (l, q ) entry. For a vector, a, aj 
denotes its /-th entry. A T represents the transpose of A and A H its Hermitian. Tr (A) is the trace of A, and det A 
its determinant. To denote the ?'-th sample we use (i), both for a vector, a(i), and a scalar, a(i). a n is a vector 
of n entries while A n is a matrix with n columns. The real and imaginary parts arc denoted by subindex r and 
j, respectively, i.e. a r = 72(a) and aj = X(a), and throughout the text j = 1. To denote the complex Gaussian 

distribution with mean vector p, covariance matrix K and pseudo-covariance matrix K we use A f ^/r. K, K j. IE[-] 
denotes statistical expectation. 

II. Proper Complex Gaussian Process Regression 

GP for regression can be presented as a nonlinear regressor that expresses the input-output relation through 
function /(x), known as latent function, that follows a GP and underlies the regression problem 

V = /( x ) + e, (1) 

where the error, e, in the estimation of the output, y, is modeled as additive zero-mean Gaussian noise. GPs are 
the natural nonlinear Bayesian extension to the linear minimum mean-squared error (LMMSE) and Wiener filtering 
[23], [14], thus, GPs provide the correct approach to solve an MMSE filter non linearly. 

For any input set {(x(i))|i = 1..... n}, where x(i) G C d are complex-valued column vectors of dimension d, 
the latent function in (1) can be designed and evaluated to provide a multidimensional Gaussian complex-valued 
random vector f n = [/(x(l)),...,/(x(n))] T , where /(x(i)) G C. In this paper we focus on both complex-valued 
input and output. The simpler real-valued input and complex-valued output case can be easily solved from the results 
herein. The immediate approach when dealing with complex-valued signals is to consider the real and imaginary 
components as separate real signals [24], for both inputs and outputs. If we wish real and imaginary parts to be 
considered dependent, we may resort to multiple output Gaussian processes to relate them [18], [16]. However, in 
these approaches we lose the complex value notion, specially for the input. 

A complex Gaussian vector is characterized not only by its mean vector //, and covariance matrix K = 

IE [(f n — /7,j(f„ — //) H ], but also by its complementary covariance or pseudo-covariance matrix K = IE [(fn - /r) (fn - aO T ], 
[1], We use the notation J\f ^/r. K, K j for a joint Gaussian probability distribution to emphasize this fact. When 
the complementary covariance matrix is zero, a complex Gaussian random vector is regarded as proper [2], [25]. 
Without loss of generality, we consider here a zero-mean process for the latent function [2], By imposing properness, 

(K)j/ = A:(x(i). x(Z)) = 0 for all i,l. The prior for this proper complex Gaussian process yields 

p(fn|X n ) = AT (0, K, 0) = nri ( l xK exp (-fH K-X ), (2) 

where X n = [x(l),..., x(n)], and (K)^ = A’(x(i), x(()) for all i, l in the input set, being A’(x(i), x(Z)) the covariance 
function. 

In the learning process we condition the output of the GPR for some new observations given the training set 
V = {(x(i),y(i))\i = l,...,n} = {X n ,y„}, where the outputs y n = [y(l),..., y(n)\ T , y(i) € C, for a given set of 
observations X n , are known. We can compute this posterior through the joint distribution of the to-be-estimated 
outputs and the training data y n . This distribution can be constructed by using (2) and the marginal likelihood, 
p (y„|X„ ), as follows. We assume that the additive noise e in (1) follows an i.i.d. proper (circular) complex Gaussian 
distribution with zero mean and variance aj. The samples in the training set arc i.i.d., hence the likelihood for the 
latent function at the training set, p (y n |f n ), is given by a factorized model 

n 

P(yn|fn) = II^)I/( X «)), ( 3 ) 

i =1 

where p(y(z)|/(x(z))) = Af(/(x(f)), aj, 0). Therefore, the likelihood is a proper complex multidimensional Gaus¬ 
sian p (y n |f n ) = A f (f n , aj I„, 0). This likelihood and the prior in (2) yield the marginal likelihood or evidence 

P (y?i|X n ) = j p(y n \f n )p(f n \X n )df n = Af(0,C,0), (4) 

where C = K + aj I n . Note that y n is also proper Gaussian. Furthermore, y n and f„ are cross-proper, as the 
complementary cross-covariance matrix E [yr)f,T] = 0- Hence, y n and f n are jointly proper [1], i.e., the composite 
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complex random vector [y J • fJ ] is Gaussian proper. If we now have a number m of test input vectors X, m = 
[x,(l),..,,x,(m)], the joint distribution of the training outputs y n and f, m = [/(x,(l)),/(x,(m))] T is as 


follows: 


yn 

f 

L »n 


Af 0, 


C K(X n , X. 
K(X. m ,X n ) K(X. m ,X. 


m) 
m ) 



(5) 


where the entry (i,l) of matrix K(X n ,X. m ) is A;(x(i). x.(Z)), and in a similar way are defined K(X #m . X„j and 
K(X, m ,X. m ). From these results, the estimated probabilistic output is the conditional distribution of f. m given 
yn: 

P y n ) = Af i ^f. m > 0) , (6) 


where 


A*f. m = K ( X .m, X n)C Vn, (7) 

Xf.,„ = K(X. ro , X. m ) - K(X. m , X„)C _1 K(X n , X. m ). (8) 

The predictive distribution is also proper Gaussian. The predictive covariance is a quadratic form of the test 
and training inputs, showing that the predictive uncertainties grow with the magnitude of the covariance matrices 
involved, as one would expect for a linear model. 


III. Complex Covariance Functions 

The covariance function is a key tool in GPR: it encodes our assumptions about the function that we wish to learn 
and measures similarity between inputs. There arc some well known examples of covariance functions used for 
real-valued applications, such as the squared exponential or Gaussian covariance function and the inhomogeneous 
polynomial kernel, among others. These kernels assume that training points that are near to a test point should 
be informative about the prediction at that point [13]. The definition of a kernel for complex-valued GPR should 
provide the same measure of similarity between the inputs. 

In the proper complex-valued Gaussian processes framework we can derive the kernel following a more principled 
approach as follows. Consider a zero-mean complex Gaussian vector f„ = f nr + jf„ p with f„, its real paid and f r)| 
its imaginary paid. The covariance matrix K = E [f n f^] is [1]: 

K = Kjt + Kjj + j (Kj r — K r j), (9) 

where K rr and Kjj £ M " xn are the covariance matrices of real and imaginary parts of f„, respectively, and 
K rj =E[f nr f nj T ] =Kj[ G M nxn is the cross-covariance matrix of real and imaginary parts. At this point it is most 
important to notice that K must be a valid covariance matrix for the Gaussian process. Real-valued covariance 
functions can now be used to write out the three real covariance matrices K rr , Kjj and K r j, but they arc interrelated. 
Given two inputs, the complex covariance function of the process is composed by three real covariance functions, 

k’(x,x) = k n {x,x.') + (x, x / ) +j (hr j(x,x) - A’ rj (x,x')) • (10) 

For the particular case of a proper complex Gaussian vector, K rr = K M and K jr = KT = — K,j [1], Therefore, 
in the case of a proper complex Gaussian process, functions A rr (x, x') = /vjjfx. x'), and Aj-j(x. x') must yield either 
a null or skew-symmetric cross-covariance matrix K r| . The covariance matrix yields 

K = K r +jKj, (11) 

where K r = 2K rr and Kj = — 2K r j. Following the guidelines in [16] we may conclude that the proposed kernel 
in (11) is a valid reproducing kernel if it is a covariance matrix for the Bayesian Gaussian process framework. 
Hence, it must be a Hermitian positive semi-definite matrix [17]. K n must be a symmetric and positive semi-definite 
matrix, since the marginals of the joint probability functions of the real and imaginary parts must be also covariance 
matrices as discussed later in this section. It follows that the condition v H Kv > 0 for any v £ C yields, 

v h Kv = v t T K r v r + Vj f K r vj — 2 Vj T K jVj > 0 (12) 

where the first terms to the right of the equality arc greater or equal to cero, since K r is positive semi-definite and 
we used v r Kjv r =v j r KjVj = 0 since Kj is skew-symmetric. In the following subsection we analyze this problem 
by decomposing it into the real and imaginary parts. 
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A. Complex GPR as multiple output GPR 

The complex case may be tackled by mapping the complex value into a two dimensional vector with real and 
imaginary parts, as in [26]. Then two GPs can be learned, one for the real part and another for the imaginary part 
of the output. Either independently or using a MOL or vector scheme [27], [18], [16]. The definition of RKHS 
valued functions in MOL parallels the one in the scalar, where now the reproducing kernel is matrix valued. In our 
problem and for the training set, we have scalar reproducing kernels K rr , K r j, Kj r and Ky, that are block matrices 
in the vector kernel matrix 


K r 


K rr (X n , X n ) K r j(X n , X n ) 
Kj r (X n , X n ) K rr (X n , X n ) 


(13) 


Lor one single test input, we write the mean of the multiple outputs as follows 


2/r 

Vi 


krr(x#,X n ) k r j(x # , X n ) 

p-t 

y r 

_kj r (x # , X n ) kjj(x # ,X n )_ 

'-'R 

_yj_ 


where Cr = Kr + of l2n is the covariance matrix in the multiple output or vector formulation. 
If we assume the output to be proper the model above yields 


Hr 


' k^x., X) k r j(x # ,X)" 


C rr c rj 


yr 

Vi. 


-k rj (x.,X) k rr (x.,X)_ 


-QjC^ 


.yj 


(14) 


(15) 


where, given that the vector covariance matrix must be symmetric, — C r j = Cf. This formulation yields the solution 
in (7) with the kernel as proposed in (11). Similar conclusions can be drawn for the covariance of the output in (8). 
This result for the multiple output case links the resulting kernel in the complex value domain to those defined in 
the MOL case. Since matrix Cr is the covariance matrix of the multiple output, it must be Hermitian and positive 
semi-definite, i.e., equation (12) must hold true for any v. Also, since the marginal for the real or the imaginary part 
of the output is also a Gaussian process, matrices K n and K r j must also be symmetric and positive semi-definite. 

If, in addition, real and imaginary parts are uncorrelated for any pair of inputs, C r j = 0, and (15) yields 

Vr 

V] 

This result can be easily derived from (7) with K = K, = 2K„ in (11). Here, (12) always holds true if K rr is 
positive semi-definite. 

The multiple output formulation is equivalent to the complex-valued one. However, the notion of complex signals 
is lost. Also, if we split the inputs into real and imaginary parts, the design of the covariance matrix becomes a 
harder task if we want to measure similarity between complex signals. In contrast, the complex formulation for 
GPR presented above allows for a kernel design in a natural way, better managing the complex nature of the output 
and the inputs, as follows. 


kpr (x#, X) C n . y r 
k rr (x.,X)C- 1 y J 


B. Design of Covariance Matrices 

In [6], [11] it is proposed a complex-valued Gaussian kernel as an extension of the real Gaussian kernel: 

M x , x 0 = ex P (-( x - x '*) T ( x - x/ *)/t) 

= exp (—(|x r - x'| 2 /7) exp (| Xj + x'| 2 / 7 ) • (cos(2(x r - x') T ( Xj + x')/ 7 ) - j sin(2(x r - x') T (xj + x')/ 7 )) , 

(17) 

where x = x, + jxj, x' = x' + jxj. In [9], the authors propose the so-called independent kernel to improve the 
previous one: 

k in d{x,X ) = Kr (x r , x') + Kr (xj,xj) +j (kr (x r ,xj) “ «R ( x j, x r)) , (18) 

where kr is a real kernel of real inputs. 

In (17) we have that the value of the kernel for (x, x r ) is the complex conjugate of the kernel for (x',x). 
This corresponds to assuming the output is proper with non-null skew-symmetric cross-covariance matrix. This 
kernel measures similarities between real parts while measures dissimilarity between imaginary ones and it is not 
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stationary. It also has an oscillatory behavior. In addition, the exponent in the kernel may easily grow large and 
positive, causing numerical problems, as later discussed in the experiments. These characteristics arc not very useful 
when modeling the underlying physics of many systems. 

The kernel in (18) solves the measure of similarity between inputs by checking for the real and imaginary parts 
independently. At this point, we lose the intuition about the complex nature of inputs. The kernel assumes again that 
the output process to model is proper complex valued, where the imaginary part is non-null and skew-symmetric. 
One of the main drawbacks of this kernel is that it is not isotropic, due to the way real and imaginary parts of the 
inputs have been split in the kernel. For example, if a real exponential kernel is used in (18) as proposed in [9], 
Kip- = a exp(— |x — x'l/./i) for some hyperparameters a and f3, whenever two inputs arc distant enough the kernel 
vanishes except for similar - imaginary parts. For any Xj = xj, the covariance yields the maximum value for Km, 
kjnd ( x . x') = a. In the imaginary part of the kernel we have a similar behavior. 

The proposed kernel in (10) much better adapts to the problem at hand. We may design the real and the imaginary 
parts of the kernel with different structure. And if the cross-covariance is null or negligible we may set the imaginary 
part of the kernel to zero. The conditions for these kernel functions to form a reproducing kernel are given in (12). 
On the other hand, we propose as measurement of similarity or metric the inner product, x H x, used in the complex 
value space as follows. 

1) Real kernels: Assume the cross-covariance cancels and we resort to (16). We do use a proper complex 
Gaussian process to model the system. But we later condition to the training samples. The resulting output within 
the input range defined by the training inputs may behave, locally, quite differently from a proper Gaussian process 
with null cross-covariance. Note that in (16) we have the same matrix multiplying the real and imaginary parts of 
the outputs. If these parts are dependent, the GPR will translate this dependence to the output. Furthermore, if the 
real part of the output has a different variance than the imaginary part, (16) properly scales the estimated output, 
by using the variance of the training outputs. For these reasons, we conjecture that if we know little about the 
cross-covariance between real and imaginary parts or it is negligible, using (16) is a good choice as long as the 
complex nature of the inputs is properly addressed. 

As already discussed, we propose using the inner product of the inputs, x H x, a simple metric in complex numbers, 
to cope with isotropy. The Gaussian kernel yields 

fc G (x,x') = exp (-(x - x') H (x - x'j/y) , (19) 

that it is stationary. Therefore, a suitable choice for /c,t(x, x') = /cjj(x, x') can be Ay;(x. x') in (19) while /c r j(x, x') 
could be set to zero if we assume that the cross-covariance between real and imaginary parts cancels. If any further 
information on the physics of the model is known, other kernels [13] can be derived using this metric. 

2) Complex-valued kernels: The design of the imaginary part of the kernel in (10) is a challenging problem that 
remains open, as it is quite system dependent. Here we propose just one kernel, but the procedure followed could 
help in other designs. The design should met condition (12). This condition is not constructive, i.e. it does not help 
designing the kernel. But it encodes intuitive facts such as the maximum absolute value of the cross-covariance 
being lower or equal to the maximum absolute value of the covariance. On the other hand, the kernel must be able 
to explain the dependencies between real and imaginary parts of the output, if known. And we are restricted to 
skew-symmetric matrices. 

We propose to model a system where the real and imaginary parts of the outputs are correlated for delayed 
points with delay p £ C d . Since the covariance matrix must be skew symmetric, the correlation will be positive (or 
negative) for a delay of p and negative (or positive) for a delay of — p. We must ensure that the kernel corresponds 
to a covariance matrix. We bring here the convolution approach [18], [19]. We model the output of the process as 
the filtering of two independent white Gaussian noises, S T and ,3), and compute the kernels from the filter responses. 
This way we met the condition of being a covariance matrix. The filters are designed to model the proposed system, 
fulfilling the imaginary part of the kernel being skew-symmetric. The convolution process is sketched in Fig. 1 
where we use exponential responses for the filters, 

fi r (x) = fij(x) = u r exp(-(x) H (x)/ 7 ) 

/irj(x) = u rj exp(—(x - /x) H (x - q)/y) 

M x ) = -Aj exp(—(x + q) H (x + pj/y) 


( 20 ) 
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Fig. 1. Convolution system model to design a kernel for proper Gaussian processes. 


and W is some zero-mean stationary Gaussian noise with variance a'f v . For these filters, it can be proved, see the 
Appendix, that up to a multiplying constant the kernel in (10) yields, 


£i(x, x') = (n 2 + u 2 ) exp 


+ ]u r u rj exp - 


27 

(d x - /r) H (d x - /jl) 
2 7 


exp 


(d x + ju) H (d x + n) 
2 7 


( 21 ) 


where d x = x' — x. Note that the constants v r] G M, v r G M and /< G C d could be set according to the problem at 
hand or learned as hyperparameters. Also, note that kernel in (19) is a particularization of (21), by setting v r] = 0. 


IV. Hyperparameters computation 

In order to turn complex GPR into a powerful practical tool we must address the model selection problem, in 
the sense of setting the hyperparameters. A major advantage of GPR is that the hyperparameters can be estimated 
by maximizing the marginal likelihood [13]. The marginal likelihood in (4) yields a zero-mean complex proper 
Gaussian random vector with covariance matrix C, and the log marginal likelihood is 

L = logp (y„|X n ) = —y^C _1 y n - logdet C - n log ir. (22) 

In our model, C = C (6) can be parameterized in terms of a set of hyperparameters, 0,, for the kernel /c(x, x') 
and the variance of the additive noise, er 2 , that can be treated as another hyperparameter. Note that matrix C (6) 
is a covariance matrix, so it is Hermitian and L ( 0 ) is a real function. In the maximization of (22) its derivative is 
needed. However, the Cauchy-Riemann conditions dictate that nonconstant real-valued functions that arc defined on 
complex domains are not holomorphic, so the complex derivative cannot be used. Instead, we must seek formal or 
Wirtinger derivatives [11], [28]. Also, since C (6) is Hermitian, it has a structure or pattern, therefore generalized 
complex-valued matrix derivatives need to be considered as follows [22]. Define 

g(C, C*) = -y^C -1 y n - log det C - nlog7r, (23) 

where C is a matrix with independent components, i.e., not Hermitian. It is immediate to see that if C is substituted 
with C in g(C, C*), the marginal likelihood is produced, i.e., 

L(0) = g( C,C*)' 


C=C(0) 


(24) 



Now, since C is unpatterned, applying the chain rule leads to 


8L 

ddi 


dg(C,& 


ddi 
n n 

EE 

k= 1 1=1 

TV 


C=C(0) 
dg(C, C* 


d(C ) 


Ik 


d(C ) 


Ik 


C=C(9) 


ddi 


(1 

f dg{ c,c*) 

\ T dC^ 


dc 

\ 

c=c ( 0 )J d9i j 


The derivative of the first term of g{ C, C*) in (23) yields 


d 

~dt 





(C T ) _1 (y«y^) T (C T ) _1 . 


The derivative of the second term of of g( C, C *) in (23) yields 

(-logdetc) =-(C T )- 1 . 

Substitution of (26) and (27) in (25) yield 


dL m 
— = TV 
dOi 


'C-Vny^C- 1 - C- 1 ) 


dC\ 

del)' 


( 25 ) 


(26) 


(27) 


(28) 


where now the pattern of matrix C does not yield further simplifications. The term dC/ddi depends on the chosen 
covariance function. 


V. EXPERIMENTS 

A. Non-null Cross-covariances 

We propose the following example where we randomly sampled a proper complex GP, to later learn it. We 
generated samples for both the real and the imaginary parts in [—6,5]. The GP covariance matrix used was the 
one in (21), constructed from the corresponding filters in (20). The kernel hyperparameters were set as 7 = 1.125 
and // = 2 + 2j. Values v T and t; rj were set to 1. The real paid of the sample function is shown in Fig. 2 (top), 
while the imaginary part is included in Fig. 3 (top). Circular complex Gaussian noise with a t = 0.1 was added 
to represent measurement uncertainty and n = 200 training noisy samples were randomly chosen. These samples 
has been depicted as circles in Fig. 2 and Fig. 3 (top). Maximization of the log marginal likelihood in (22) using 
(28) yielded the following estimated values of the hyperparameters: 7 = 1.0857, // = 1.9671 + j 1.8690, and 
a e = 0.0984. Then, we found the mean (7) and variance ( 8 ) of the predictive distribution using those training 
samples and the estimated values of the hyperparameters. The real and the imaginary parts of the predictive mean 
(7) arc depicted in Fig. 2 (bottom) and Fig. 3 (bottom), respectively. The mean squared error of the estimation was 
10\og 10 MSE = —12.52 dB, computed for 6400 inputs. 

To further study the accuracy of the prediction, we include in Fig. 4 and Fig. 5 two slices of the surfaces in 
Fig. 2 and Fig. 3, respectively. In Fig. 4 the real part of the sample function of the process is plotted (dashed line) 
versus the real paid of the input, the imaginary paid of the input was fixed to the value T(x) = 4.4430. The real part 
of the prediction in (7) is depicted in thick red line, along with the grey shaded area that represents the pointwise 
mean plus and minus two times the standard deviation. The blue circles mark the training samples. Similarly, in 
Fig. 5 the imaginary part of the input was fixed to the value T(x) = —0.5696 and the imaginary part of the sample 
function of the process is plotted (dashed line) versus the real paid of the input. Again, the imaginary paid of the 
prediction and the training samples arc depicted. Four instances of the posterior ( 6 ) are plotted in thin blue lines 
in Figs. 4 and 5. 
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Fig. 2. Real part of the output y(x) (top) and real part of the mean estimation (7) (bottom) versus the real and imaginary parts of the input, 
IZ(x) and Z(x), respectively. The training samples are depicted as blue circles. 




Fig. 3. Imaginary part of the output j/(x) (top) and imaginary part of the mean estimation (7) (bottom) versus the real and imaginary parts 
of the input, IZ(x) and T(x), respectively. The training samples are depicted as blue circles. 


B. Null Cross-covariances 

The performance of the proposed proper complex GPR was tested in the context of a nonlinear channel 
equalization task. Two nonlinear channels were considered, as in [11] and [8]. Both channels consisted of a linear 
filter f(n) = (—0.9 + 0.8j) • s(n) + (0.6 — 0.7j) • s(n — 1) and a memoryless nonlinearity. The nonlinearity was 
q{n) = t(n) + (0.1 + 0.15j) • t 2 (n ) + (0.06 + 0.05j) • i 3 (n) for the first case (labeled as soft nonlinear channel ), 
and q(n) = t(n) + (0.2 + 0.25j) • f 2 (n) + (0.12 + 0.09j) • t 3 (n) for the second case (labeled as strong nonlinear 
channel). The input signals had the form s(n) = 0.70(-\/l — p 2 X{n) + j pY(n)), as in [11] and [ 8 ], and X(n) 
and Y (n) were Gaussian random variables. Note that the real and the imaginary parts of the input signals were 
generated independently and, therefore, had null cross-covariances. Also note that the input signals are circular for 
p = l/v/2 and highly noncircular if p approaches 0 or 1. At the receiver end of the channel, the signal q{n) was 
coiTupted by additive white circular Gaussian noise with the SNR set to 16 dB. 

The aim of a channel equalization task is to construct an inverse filter, which acts on the received signal r{t) 
and reproduces the original input signal s{n ) as close as possible. To this end, the inputs to the equalizer were the 
sets of samples x = [r(n + D),r(n + D — 1), ■ ■ ■ , r(n + D — L + 1)] T , where L > 0 is the filter length and D is 
the equalization time delay. 

Experiments were conducted as in [11] and [ 8 ], where L = 5 and D = 2, on a set of 3000 samples of the 
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n{x) 

Fig. 4. Real part of the output (thick black dashed line) and real part of the predictive mean (7) (thick red line) versus 1Z(x) for T(x) = 4.4430. 
Training samples are depicted as blue circles. Four instances of the posterior (6) are plotted in thin blue lines. 



n{x) 

Fig. 5. Imaginary part of the output (thick black dashed line) and imaginary part of the predictive mean (7) (thick red line) versus lZ(x) 
for T(x) = —0.5696. Training samples are depicted as blue circles. Four instances of the posterior (6) are plotted in thin blue lines. 


input signal considering both the circular and the noncircular (p = 0 . 1 ) cases and the described nonlinear channels 
(soft and strong). In all cases the results were averaged over 100 trials where the input signals s(n) and noise 
output were generated randomly. In Figs. 6 to 9 we include the MSE along training samples for several methods 
and the channels and inputs described. As in [11] and [ 8 ] the predicted outputs for the training inputs were used 
to compute the error. The MSE value depicted for each sample is the averaged MSE for the last 100 samples 
as in [29] 1 . For the sake of comparison we depict the results for the NCLMS, the NCKLMS2 in [11] and the 
ACKLMS [ 8 ] algorithms, that use the complex Gaussian kernel in (17). We used the code available in [29] to run 
these algorithms. For the ACKLMS algorithm all the parameters were set to the values described in [ 8 ] (7 = 10 2 
and the step update parameter 1/8). Also, for the NCKLMS2 algorithm and the soft nonlinear channel case the 
parameters were set to the values described in [ 8 ] (7 = 10 2 and the step update parameter 1 / 8 ), whereas for the 
strong nonlinear channel case the parameters were set to the values described in [11] (7 = 5 2 and the step update 
parameter 1/4). The novelty criterion was used for the sparsification of both the NCKLMS2 and the ACKLMS 
algorithms with cfi = 0.15 and 62 = 0.2. In Figs. 8 and 9 we observe stability problems in the learning process 
of both NCKLMS2 and ACKLMS, due to the kernel used. This problem was alleviated by using sparsification. 
We also used the independent kernel (18) with kr being the real Gaussian kernel in the NCKLMS2 approach, as 
proposed in [9]. We labeled this algorithm as NCKLMS2-i. The tunable parameter was set to 7 = 5 2 and the step 
update parameter to 1/8. The novelty criterion was again used for the sparsification. 

Tn the code the average was performed for the last 500 samples. 





11 



Fig. 6. Learning curves for NKCLMS2, NCKLMS2-i, NKCLMS2-G, AKCLMS, the proper complex GPR (CGPR) and the optimized CGPR 
(opt-CGPR) for the soft nonlinear channel equalization problem for the circular input case. 


In this paper we propose the mean (7) of the predictive distribution for the proper complex GPR as the equalizer 
output. We used the kernel in (10) where /c n (x, x')= /cy(x, x') = &g(x, x')/2 with &g(x, x') as in (19), and 
fc r j(x, x') = 0, since the real and the imaginary parts of s(n) had null cross-covariances. Two approaches were 
simulated. In one procedure, which we labeled as opt-CGPR, 1000 input-output samples were randomly chosen to 
tune the kernel hyperparameter 7 so that the best possible results were obtained. The noise variance of in the kernel 
was set to the valued used in the experiments. Then, with those hyperparameter values, for each new input x the 
equalizer predicted the corresponding s(n) as in (7) using all the previous input-output pairs as training set. It can 
be observed through the figures the remarkable good results of this proposal in all the cases, soft or strong nonlinear 
channels and circular or noncircular signals. Also, note that the MSE increases with the number of samples until 
reaches steady estate, as CGPR generalizes for an increasing number of input-output pairs. In a second approach, 
labeled as CGPR, we tried to check the hyperparameter estimation capabilities of the proper complex GPR by 
means of the maximization of the log marginal likelihood (22) using (28). The first 250 samples were used as 
training set to estimate the hyperparameters (7 and oy). Then, with those values for 7 and a; fixed, and for each 
new input x the equalizer predicted the corresponding s(n) as in (7) using all the previous input-output pairs as 
training set. The learning curves for this procedure are depicted in Figs. 6 to 9. The results for this procedure, 
CGPR, not far from that of the opt-CGPR, considerably outperform the NCKLMS2 and ACKLMS and illustrate 
the good estimation of the hyperparameters. 

Finally, we proposed to improve the NCKLMS2 by using the kernels derived in Section III, in particular the 
kernel in (19) with null imaginary part. This algorithm is labeled as NCKFMS2-G. We set all the parameters for 
this algorithm to the same values indicated in [11] (7 = 5 2 and the step update parameter 1/4), and the novelty 
criterion was again used for the sparsification. While the NCKFMS2-i algorithm exhibits improved performance 
compared to the NCKFMS2 or ACKFMS algorithms, the NCKLMS2-G algorithm outperforms all of them with 
a smooth learning curve under all conditions. This proves the proposed complex kernel design procedure to be a 
valuable one. In this equalization problem the cross-covariance between real and imaginary parts of the signals 
to-be-learned, s(n), is null. Therefore, we set the imaginary part of the kernel to zero with much better results. 
Also, the measure of similarity of inputs through the simple norm of the complex difference between inputs endows 
the kernel with useful properties such as isotropy and stationarity, better fitting the underlying model. 

VI. Conclusions 

In this paper we prove that the kernels for complex-valued regression proposed in the literature correspond to 
proper complex outputs. Since proper complex-valued Gaussian processes have been widely studied, we propose 
to resort to the Gaussian processes for regression to study this problem. On the one hand, we end with a new 
solution that it is endowed with the advantages of GPs: a probabilistic output and the availability of estimating the 
hyperparameters by optimization. On the other, we derive an expression for the reproducing kernel in the proper 
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Fig. 7. Learning curves for NKCLMS2, NCKLMS2-i, NKCLMS2-G, AKCLMS, the proper complex GPR (CGPR) and the optimized CGPR 
(opt-CGPR) for the strong nonlinear channel equalization problem for the circular input case. 



Fig. 8. Learning curves for NKCLMS2, NCKLMS2-i, NKCLMS2-G, AKCLMS, the proper complex GPR (CGPR) and the optimized CGPR 
(opt-CGPR) for the soft nonlinear channel equalization problem for the noncircular input case (p = 0.1). 



Fig. 9. Learning curves for NKCLMS2, NCKLMS2-i, NKCLMS2-G, AKCLMS, the proper complex GPR (CGPR) and the optimized CGPR 
(opt-CGPR) for the strong nonlinear channel equalization problem for the noncircular input case (p = 0.1). 


complex case. We conclude that it is complex valued if the cross-covariance of the real and imaginary parts of 
the outputs is not null. The imaginary part of the covariance matrix is skew-symmetric, and the full matrix must 
be semidefinite positive, but the kernel used in the real and imaginary parts need not to be the same. Besides, if 
the cross-covariance cancels, the kernel is real-valued. The MOL point of view bears this out. Therefore, we prove 
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the kernel to exhibit a much more flexible structure than proposed in previous works. We end designing a kernel 
where the use of an appropriate metric yields an isotropic and stationary kernel. The proposed methods developed 
by fully using these key results exhibit a remarkable good performance compared to previous ones. 


Appendix 

Covariance Function 

We follow here a procedure si mi lar to that in [18]. Consider two independent, real, stationary, Gaussian white 
noise processes S T (x) and <Sj(x), where x G C d , producing an output y(x) = U(x ) + W{x), where W(x) is a 
stationary Gaussian white noise, and U (x) is defined by the sum of convolutions 


u (x) = (/l r (x) +j/lrj(x)) *S T (x) + (h r (x) + j/l jr (x)) * 5j (x) 

4 

= ^2 ^™( x ) * (29) 

m=l 

where we have used the following notation: h\(x) = h T (x), ^(x) = j/i r j(x), h$(x) = h T (x), h±{x) = j/ij r (x), 
5i(x) = <52 (x) = S'r(x), and 53 (x) = ^(x) = Sj(x). The covariance of T(x) i s derived as follows: 

C(x a , x b ) = C u(x a , x h ) + crfydab, (30) 

where is the variance of W(x), and 


Cu(x a ,x b )=E[U r (x a )U;(x b )} 

4 

I , 

= IE 


^ ' / h m {oi)S m (x a a)d ot 

m= 1 Jcd 

±[ K(P)S n (.x b -(3)d d (3 
n= l Jcd 
4 4 

= EE 


4 4 ^ 

[ h m (a)h* n (P) 

rn=\ n=\ ^ Jc« 

■ E [S m (x a - a)S n (x b - (3)\ d d ad d f3^ . (31) 

Since S’i(x) = S^x) = S T (x), and 53 (x) = 54 (x) = Sj(x), processes S m (x a — a) and S n (x b — (3) covary 
only if m, n G {1,2} or m, n G {3,4}, and (x a — a) = (x& — /?). In such cases, E [S m (x a — a)S n {x b — f})] = 
6 (a — (x a — Xfe + /?)), and the integrals in (31) yield 


h m {a)K(P) 8 (a - (x a - x b + /3))d d ad d /3 


/c d dc d 


= / hm((3 + (x a — Xf,))/i* (/3)d d /3 

J C d 

= [ h m ((3+d x )h* n (m d p, 

J C d 


(32) 


where d x = x a — x b . Hence, 


2 2 


Cu(x a ,x b ) =^2^2 h rn(/3 + d^)h* n {/3)d d f3 

m= 1 n= 1 ^ d 

4 4 

+ VV I h m (/3 + d x )/r*(/3)d d /3. 

m=3n=3^ Cd 


( 33 ) 
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If we set the kernels to parameterized exponentials as /i*(x) = (j) Ci v t exp(—(x — /x,) IT fx — //,)/S), integrals in (33) 
are as follows 


/ c d 


h m {P + d x )/r* (/3)d/3 d 


= (j) C "*(-j ) Cn V m V n 


exp 


ic d 


{P ~ Vn) (P ~ fJ>n) 


7 


• exp 


(P + d x + d x fim) 


=(j) CT " (—J ) Cn v m v n 


I c d 


■ exp 


7 

exp 
\H/ 


d/3° 


03-£) H Q3-fl) 

0.57 


d/3 d 


(dx H-m + f^n) (d x f^m H“ l^n) 


2 7 


=U)°-H) C " (T-Yvm, 


V 2 J 


■ exp 


(d x H m + /4n)^(d x fi m + fl n ) 

27 


(34) 


where /3 = 0.5(/x n - (d x - 

We propose the following parameter values. For /ii(x) = (13 (x) = /i r (x), we set 17=113 = 17, //1 = 73 = 0 and 
ci = C3 = 0 . For h-2(x) = j/irj(x), we set 72 = 7, 112 = u r j and C2 = 1 . And for /izt(x) = j/tj r (x), we set 74 = —7, 
17 = —n r j and C4 = 1 . By making use of this values and ( 34 ), ( 33 ) yields 


C[/(x a ,X5) — Ct/(d x ) 
firj\ d 2 / d“d x 

= U) ^ exp (-^r 


-i (ny 


) ^rjexp 
+ j(^) fr^rjexp^ 


(d x + /x) H (d x + jU) 

27 

(d x - /i) H (d x - 7 ) 
2 7 


+ (^) d ^e X p (-iA) + (=)7?exp 

(d x + /i) H (d x + / u) 


d X d X 

27 


. /7T7\ rf 

“HtJ ^ rjexp 


, /VT77 2 f d^d x 

+ ( T j ^exp^- — 

= (?) ( 2 ^r 2 + 21$ exp ( - 
+ j 2 iyu rj ( exp ( - 


27 


d X d X 


27 

(d x - ^) H (d x - 7 ) 
2 7 


exp 


(d x + /x) H (d x + 7 ) 
27 


(35) 
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